Skip to content

Conversation

@dchigarev
Copy link
Contributor

@dchigarev dchigarev commented Nov 15, 2024

It was discovered that linalg.matmul_transpose_b is lowered to xegpu incorrectly in the case of large tiles. The issue is caused by a std::swap placed inside a nested for-loop, which, instead of swapping rowOffs and colOffs only once, performs the swap in every iteration, resulting in incorrect offsets.

Comment on lines -709 to -711
if (transpose) {
std::swap(newRowOffs, newColOffs);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

placing std::swap inside a nested for-loop was a bad idea since it swaps the values each iteration producing non-sense offsets at the end

Signed-off-by: dchigarev <[email protected]>
%subview_1 = memref.subview %arg0[%arg3, 0] [32, 1024] [1, 1] : memref<1024x1024xf16> to memref<32x1024xf16, strided<[1024, 1], offset: ?>>
%subview_2 = memref.subview %arg1[%arg4, 0] [32, 1024] [1, 1] : memref<1024x1024xf16> to memref<32x1024xf16, strided<[1024, 1], offset: ?>>
linalg.matmul_transpose_b ins(%subview_1, %subview_2 : memref<32x1024xf16, strided<[1024, 1], offset: ?>>, memref<32x1024xf16, strided<[1024, 1], offset: ?>>) outs(%subview_0 : memref<32x32xf16, strided<[1024, 1], offset: ?>>)
scf.parallel (%arg3, %arg4) = (%c0, %c0) to (%c1024, %c1024) step (%c16, %c64) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

increased the tile size for Y axis to test the problematic case

Comment on lines 45 to +48
// CHECK: %[[tB:.+]] = xegpu.update_nd_offset %[[rootB]], [%c0, %c0]
// CHECK: %[[tB1:.+]] = xegpu.update_nd_offset %[[rootB]], [%c16, %c0]
// CHECK: %[[tB2:.+]] = xegpu.update_nd_offset %[[rootB]], [%c32, %c0]
// CHECK: %[[tB3:.+]] = xegpu.update_nd_offset %[[rootB]], [%c48, %c0]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it used to do something like:

xegpu.update_nd_offset %[[rootB]], [%c0, %c0]
xegpu.update_nd_offset %[[rootB]], [%c16, %c0]
xegpu.update_nd_offset %[[rootB]], [%c32, %c16]
xegpu.update_nd_offset %[[rootB]], [%c16, %c32]

@dchigarev dchigarev marked this pull request as ready for review November 15, 2024 15:15
@dchigarev dchigarev merged commit 8b64109 into intel:main Nov 18, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants